library(apts.doe)
library(AlgDesign)
library(rsm)
library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(plotly)
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
Attaching package: ‘plotly’
The following object is masked from ‘package:ggplot2’:
last_plot
The following object is masked from ‘package:stats’:
filter
The following object is masked from ‘package:graphics’:
layout
AlgDesignWe will study three examples of using AlgDesign to find designs under linear models \[
Y = X\beta + \epsilon\,,
\] with \(X\) the model matrix, \(\beta\) the \(p\)-vector of unknown model coefficients and \(\epsilon\) an \(n\)-vector of identically and independently normally distributed random errors, each having variance \(\sigma^2\).
A \(D\)-optimal design maximises the determinant of the information matrix \(X^T X\).
D-optimal design for the regression model
\[ Y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \epsilon_i \]
x <- seq(-1, 1, length = 101) # candidate list
dopt1 <- optFederov(~ quad(x), data = x, nTrials = 9, criterion = "D")
dopt1
$D
[1] 0.5088144
$A
[1] 3.139587
$Ge
[1] 0.941
$Dea
[1] 0.939
$design
$rows
[1] 1 2 3 50 51 52 99 100 101
The function returns
D: the value of the determinant of normalised information matrix (divided by \(n\)) raised to the power \(1/p\)A: the value of the trace of the inverse of the information matrix, divided by \(p\)Ge and Dea: the \(G\)- and \(D\)-efficiencies relative to the theoretical optimal design (which may not be realisable)rows: the rows of the candidate list that correspond to the optimal design.A pharmaceutical industry experiment (Owen et al., 2001, Organic Process Research and Development) investigated four factors in \(n=30\) runs. We will find a \(V\)-optimal design for this experiment, assuming
\[ -2 \le x_1, x_2, x_3, x_4 \le 2\,. \]
Note that AlgDesign refers to \(V\)-optimality as \(I\)-optimality (\(I\) for integrated variance).
cand.list <- expand.grid(x1 = seq(-2, 2, length = 5), x2 = seq(-2, 2, length = 5),
x3 = seq(-2, 2, length = 5), x4 = seq(-2, 2, length = 5))
dopt2 <- optFederov(~ quad(x1, x2, x3, x4), data = cand.list, nTrials = 30,
criterion = "I")
dopt2
$D
[1] 4.048093
$A
[1] 0.7305919
$I
[1] 11.31889
$Ge
[1] 0.556
$Dea
[1] 0.45
$design
$rows
[1] 1 5 21 25 65 73 103 111 124 178 263 264 273 311 313 326 355 371 375 501
[21] 505 521 525 554 561 575 601 613 615 622
It is interesting to compare the performance of this design under the various optimality criteria with the original design used in the study. This was a central composite design, or CCD, a common multi-factor design for fitting regression models. To do this we generate a CCD using the rsm package.
ccd_design
run.order std.order x1.as.is x2.as.is x3.as.is x4.as.is
1 1 1 -1 -1 -1 -1
2 2 2 1 -1 -1 -1
3 3 3 -1 1 -1 -1
4 4 4 1 1 -1 -1
5 5 5 -1 -1 1 -1
6 6 6 1 -1 1 -1
7 7 7 -1 1 1 -1
8 8 8 1 1 1 -1
9 9 9 -1 -1 -1 1
10 10 10 1 -1 -1 1
11 11 11 -1 1 -1 1
12 12 12 1 1 -1 1
13 13 13 -1 -1 1 1
14 14 14 1 -1 1 1
15 15 15 -1 1 1 1
16 16 16 1 1 1 1
17 17 17 0 0 0 0
18 18 18 0 0 0 0
19 19 19 0 0 0 0
20 20 20 0 0 0 0
21 21 21 0 0 0 0
22 22 22 0 0 0 0
23 1 1 -2 0 0 0
24 2 2 2 0 0 0
25 3 3 0 -2 0 0
26 4 4 0 2 0 0
27 5 5 0 0 -2 0
28 6 6 0 0 2 0
29 7 7 0 0 0 -2
30 8 8 0 0 0 2
Data are stored in coded form using these coding formulas ...
x1 ~ x1.as.is
x2 ~ x2.as.is
x3 ~ x3.as.is
x4 ~ x4.as.is
McNamara et al. (2005, Tech. Report) described an experiment to investigate the relationship between chemical yield and choice of solvent.
solvents <- read.csv("solvents.csv", header = T)
solvents
\(D\)-optimal design to estimate a quadratic regression model with \(n=20\)
optFederov requires the candidate list to have more rows than \(n\), so we use two replicates of the solvent listdopt3 <- optFederov(~ quad(dc, dp, ws), data = rbind(solvents, solvents, solvents),
nTrials = 20, criterion = "D")
with(dopt3$design, {
dopt3$design[order(solvent), ]
})
Notice that not all of the solvents have been selected in the design (so some have replication 0).
dopt3$design %>% count(solvent)
Compare this design to the \(D\)-optimal design that would be found if we assumed all possible combinations of descriptor values were possible (or a least a much wider range). That is, what design would be chosen if we were not restricted to our list of 13 solvents?
descrip.list <- expand.grid(dc = seq(-1, 1, len = 9), dp = seq(-1, 1, len = 9),
ws = seq(-1, 1, len = 9))
dopt3b <- optFederov(~ quad(dc, dp, ws), data = descrip.list, nTrials = 20,
criterion = "D")
with(dopt3b$design, {
dopt3b$design[order(dc, dp, ws), ]
})
This design has only one point in the interior of the design space (at the centre point), unlike the design constructed from the original candidate list. It has no replication.
dopt3b$design %>% count(dc, dp, ws)
round(100 * (dopt3$D/dopt3b$D) ^ (1 / 10))
[1] 82
Original design has low efficiency but is realisable (the solvents exist!), whereas we do not know if the combinations of the descriptors in the second design correspond to actual solvents.
plot_ly(dopt3$design %>% count(dc, dp, ws), x = ~dc, y = ~dp, z = ~ws, type = "scatter3d", mode = "markers", color = ~n)
plot_ly(dopt3b$design, x = ~dc, y = ~dp, z = ~ws, type = "scatter3d", mode = "markers")